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METHOD AND APPARATUS 
FOR COMPLEXITY SCALABLE CODEC 

The present invention is directed towards video coders and/or decoders 
5 (CODECs), and more particularly towards an apparatus and method for scalable 
complexity CODECs. 

It is desirable for a broadcast video application to provide support for diverse 
user devices, without incurring the bitrate penalty associated with simulcast 

10 encoding. Video decoding is a complex operation, and the complexity is very 
dependent on the resolution of the coded video. Low power portable devices 
typically have very strict complexity restrictions and low resolution displays. 
Simulcast broadcast of two or more video bitstreams corresponding to different 
resolutions can be used to address the complexity requirements of the lower 

15 resolution devices, but requires a higher total bitrate than a complexity scalable 

system of this invention. This invention provides a solution that allows for complexity 
scalable decoders while maintaining high video coding bitrate efficiency. 

Many different methods of scalability have been widely studied and 
standardized, including SNR scalability, spatial scalability, temporal scalability, and 

20 fine grain scalability, in scalability profiles of the MPEG-2 and MPEG-4 standards. 
[1], [2], [4]. Most of the work in scalable coding has been aimed at bitrate scalability, 
where the low resolution layer has a limited bandwidth. Figure 1 shows a typical 
spatial scalability system, where low resolution decoders are connected to a low 
bandwidth network, and high resolution decoders are connected to a high bandwidth 

25 network. Scalable coding has not been widely adopted in practice, because of the 
considerable increase in encoder and decoder complexity, and because the coding 
efficiency of scalable encoders is typically well below that of non-scalable encoders. 

Spatially scalable encoders and decoders typically require that the high 
resolution scalable encoder/decoder provide additional functionality than would be 

30 present in a normal high resolution encoder/decoder. In an MPEG-2 spatial scalable 
encoder, a decision is made whether prediction is performed from a low resolution 
picture or from a high resolution reference picture. An MPEG-2 spatial scalable 
decoder must be capable of predicting either from the low resolution picture or the 

1 
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high resolution picture. Two sets of reference picture stores are required by an 
MPEG-2 spatial scalable encoder/decoder, one for low resolution pictures and 
another for high resolution pictures. Figure 2 shows a block diagram of a spatial 
scalable encoder supporting two layers. Figure 3 shows a block diagram of a spatial 
5 scalable decoder supporting two layers. 

In Figure 2, a high resolution input video sequence is received. It is 
downsampled to create a low resolution video sequence. The low resolution video 
sequence is encoded using a normal low resolution video compression encoder, 
creating a low resolution bitstream. The low resolution bitstream is decoded using a 
10 normal low resolution video compression decoder. (This function may be performed 
inside of the encoder.) The decoded low resolution sequence is upsampled, and 
provided as one of two inputs to a scalable high resolution encoder. The scalable 
high resolution encoder encodes the video to create a high resolution scalable 
bitstream. 

15 In Figure 3, both a high resolution scalable bitstream and low resolution 

bitstream are received. The low resolution bitstream is decoded using a normal low 
resolution video compression decoder, which utilizes low resolution frame stores. 
The decoded low resolution video is upsampled, and then input into a high resolution 
scalable decoder. The high resolution scalable decoder utilizes a set of high 

20 resolution frame stores, and creates the high resolution output video sequence. 

Figure 4 shows a block diagram of a typical non-scalable video encoder used 
in the H.264/MPEG AVC standard. [3] Figure 4 shows a block diagram of a typical 
non-scalable video decoder used with H.264/MPEG AVC. Figure 5 shows a block 
diagram of a normal non-scalable video decoder. In an earlier filed provisional 

25 patent application [5], it was proposed that H.264/MPEG AVC be extended to use a 
Reduced Resolution Update (RRU) mode. The RRU mode improves coding 
efficiency at low bitrates by reducing the number of residual MBs to be coded, while 
performing motion estimation and compensation of full resolution pictures. Figure 6 
shows a RRU video encoder. Figure 7 shows a RRU video decoder. 

30 



2 



PU040098 

[1] MPEG-2, ISO/IEC 12818-2, "Generic coding of moving pictures and associated audio 

information: Video" 
[2] MPEG-4, 14496-2:1999, "Coding of audio-visual objects" 

[3] Wiegand, "Draft ITU-T Recommendation and Final Draft International Standard 
5 of Joint Video Specification (ITU-T Rec. H.264 I ISO/IEC 14496-10 AVC) n , Mar 31, 2003. 

[4] F. Wu, S. Li, R. Yan, X. Sun, and Y. Zhang, "Efficient and Universal Scalable Video Coding," 
ICIP2002. 

[5] A. Tourapis and J. Boyce, " Reduced Resolution Slice Update Mode for Advanced Video 
Coding," PU040073, U.S. Provisional Patent Application No. 60/551,417 filed on March 9, 
10 2004. 

The present invention is useful in that it enables a broadcast video system 
with diverse user endpoint devices, while maintaining coding efficiency. Without loss 
in generality, consider a system which supports two different levels of decoder 

15 complexity and resolution. A low resolution decoder has a smaller display size and 
has very strict decoder complexity constraints. A full resolution decoder has a larger 
display size and less strict but still important decoder complexity constraints. 

A broadcast or multicast system transmits two bitstreams, a base layer with 
bitrate BR baS e and an enhancement layer with bitrate BR en han. The two bitstreams 

20 may be multiplexed together and sent in a single transport stream. Figure 8 

illustrates a complexity scalability broadcast system, which includes a complexity 
scalability video encoder and a low resolution decoder and a full resolution decoder. 
The low resolution decoder processes only the base layer bitstream and the full 
resolution decoder processes both the base layer bitstream and the enhancement 

25 layer bitstream. 

A key goal of this system is to minimize BR ba se+ BR en han- This differs 
somewhat from a typical scalability system where minimizing BR baS e itself is also 
considered important, as shown in Figure 1 where the low resolution devices are 
connected via low bandwidth network. In the complexity scalability system, it is 

30 assumed that both the base layer and the enhancement layer are broadcast, so the 
bitrate of the base layer bitstream itself is not necessarily as highly constrained. 

In accordance with the principles of the present invention, the bits used for 
coding of the video residual formed after motion estimation/compensation are used 
in both the low resolution decoder and the full resolution decoder. The motion 

35 vectors (mvs) transmitted in the base layer bitstream are used in both the low 

resolution decoder and the full resolution decoder, but with a higher accuracy in the 
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full resolution decoder than in the low resolution decoder. Also, the motion 
compensation prediction is done at a low resolution in the low resolution decoder 
and at a high resolution in the high resolution decoder. Similarly to what is done in 
the RRU codec of Figures 6 and 7, the motion blocks at the low resolution 
5 correspond to larger blocks at the high resolution. So, when applied to the 

H.264/MPEG AVC codec, for example, the allowable motion block sizes of 16x16, 
16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 are used in the low resolution base layer, but 
correspond to larger block sizes of 32x32, 32x16, 16x32, 16x16, 16x8, 8x16, and 
8x8, respectively, at the full resolution. 

10 The low resolution decoder uses only the base layer bitstream. An additional 

enhancement layer bitstream is also transmitted, e.g. using 16x16 macroblocks, for 
use in the full resolution decoder. The enhancement layer bitstream includes a full 
resolution error signal, to be added to the result of decoding of the base layer 
bitstream, which was done with full resolution motion compensation. The bitrate of 

15 the enhancement layer may end up being lower than that of the base layer, which 
differs from the typical spatial scalability case where the base layer bitrate is typically 
small compared with the enhancement layer bitrate. A full resolution error signal is 
not necessarily sent for every coded macroblock or slice/picture. 

Figure 9 shows a block diagram of a low resolution decoder in accordance 

20 with the principles of the present invention. The base layer bitstream is entropy 

decoded. The motion vectors are rounded to reduce them in accuracy to correspond 
to the low resolution. The remaining blocks are identical to those found in a standard 
video decoder, including inverse quantization and inverse transform, motion 
compensation, and deblocking filter. The complexity of this low resolution scalable 

25 decoder is very similar to that of a non-scalable decoder, as scaling of motion 

vectors is of very low complexity. If factors of 2 are used in the resolution ratios in 
each dimension between the low and full resolution, the rounding can be 
implemented with just a right shift or an add and a right shift, depending whether 
rounding up or rounding down is selected in the system. 

30 In an alternative embodiment of the present invention, the motion vectors 

transmitted in the base layer are not of the higher resolution. In this case, the low 
resolution decoder can be completely backwards compatible with an existing coding 
standard. However, such a system may be of lower coding efficiency as the 
additional bit accuracy of the motion vectors for the full resolution are transmitted in 
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the enhancement layer bitstream. In this case the enhancement layer could be 
coded similar to a P slice, and motion vectors are differentially coded first based on 
layer prediction (i.e. differentially coded versus the corresponding low resolution 
layer mv), and secondly using spatial prediction (i.e. differentially coded versus 
5 adjacent mvs or even versus adjacent differential mvs). 

Figure 10 shows a block diagram of a full resolution decoder in accordance 
with the present invention. The portion of the decoder that operates on the base 
layer bitstream is similar to an RRU decoder. After entropy decoding and inverse 
quantization and inverse transform, the residual is upsampled. Motion compensation 

10 is applied to the full resolution reference pictures to form a full resolution prediction, 
and the upsampled residual is added to the prediction. If a full resolution error signal 
is present in the enhancement layer bitstream, it is entropy decoded and inversed 
quantized and transformed, and then added to the RRU reconstructed signal. The 
deblocking filter is then applied. Presence of full resolution error signal could be 

15 signaled at the macroblock level with the use of a Skip macroblock mode. If a 

macroblock is marked as skipped no additional error signal is present, while if not, 
the delta_quant, the coded block pattern and the actual residual have to also be 
transmitted. Skip macroblocks could also be run-length coded to further increase 
efficiency. An additional intra directional prediction mode may be created that 

20 performs no directional prediction. Although it may be more efficient to not perform 
any additional prediction if a macroblock in the enhancement layer is skipped, 
additional prediction could also be inferred by considering adjacent macroblocks. For 
example, if all intra prediction modes as described in H.264 are available, then an 
additional prediction for skip could also be generated which can be derived from the 

25 prediction modes of the adjacent macroblocks (i.e. minimum directional prediction) 
which is then added to the RRU reconstructed signal to generate the final prediction. 
Similarly, an additional direct intra mode could also be used which could also derive 
its directional prediction mode from adjacent macroblocks, while still allowing the 
transmission of an error signal. 

30 A key difference in this architecture from a traditional spatial scalable decoder is that 
there is no need for two sets of reference pictures stores and motion compensation 
units. This full resolution decoder contains only full resolution reference pictures 
stores and only performs motion compensation once at the full resolution. In 
contrast, the spatial scalability decoder of Figure 3 includes both full resolution and 
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low resolution reference pictures stores, and performs motion compensation at both 
the full resolution and the low resolution. This leads to a significant reduction in 
computations, memory, and memory bandwidth for full resolution decoders in 
accordance with this invention as compared to traditional spatial scalable decoders. 
5 The decoder complexity of the full resolution scalable decoder is similar to 

that of a normal video decoder of the same resolution. The inverse quantization and 
inverse transform blocks for the base layer bitstream are of lower complexity, as they 
operate on few blocks that a normal decoder. However, additional entropy decoding 
and inverse quantization and inverse transform are used for the enhancement layer 
10 bitstream. The motion compensation and the deblocking filter, which are the most 
computationally complex blocks of a decoder, are unchanged from a normal 
decoder. 

In an embodiment of the present invention, an enhancement layer bitstream full 
resolution error signal is only sent when intra-coded (I) slices are present in the base 
15 layer. Limiting the use of the enhancement layer for only I slices limits the decoder 
complexity for software implementations. I slices generally require fewer 
computations than P and B slices, and hence there should be spare CPU cycles 
available for the additional entropy decode and inverse quantization and inverse 
transform operations. 

20 Figure 1 1 shows an example of a Complexity Scalable Video Encoder. This encoder 
attempts to optimize the full resolution video quality rather than the low resolution 
video quality. Motion estimation is performed on the full resolution video picture. 
After subtraction the motion compensated prediction from the input picture, the 
prediction residual is downsampled. Unlike in the RRU codec, the downsampling is 

25 applied to all pictures, so that the low resolution decoder can always have a picture 
to decode. The downsampled residual is transformed and quantized, and entropy 
coded. This forms the base layer bitstream. The inverse quantizer and inverse 
transform is applied, and then the coded residual is upsampled back to the full 
resolution. The encoder can choose whether or not to send an enhancement layer 

30 full resolution error signal for the picture or slice. In general, an enhancement layer 
full resolution error signal is coded for all I slices, and can be optionally sent for P 
and B slices based on the magnitude of the error signal when the full resolution input 
picture subtracts the decoded upsampled. If an enhancement layer full resolution 
error signal is to be coded, the coded base layer upsampled coded picture is 
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subtracted from the input full resolution picture. The difference is then quantized, 
transformed and entropy coded to form the enhancement layer bitstream. The 
enhancement layer bitstream can be seen as containing only intra-coded slices. 
In an alternative embodiment, a joint optimization of both the low resolution 
5 and full resolution pictures could take place. That would require addition of a full low 
resolution decoder model inside of the scalable encoder, and low resolution 
reference pictures stores, and an additional low resolution motion estimation block. 
Any of several different upsampling and downsampling filters can be used, for 
example bilinear interpolation, zero order hold, or multi-tap filters. 

10 Additional deblocking filters could be added in the full resolution decoder and 

the scalable encoder, prior to the addition of the enhancement layer error signal. 
Deblocking could in this case also consider the enhancement layer macroblock 
modes used, i.e. if all affected blocks are skipped, no additional deblocking is 
applied, otherwise different strength filtering is applied depending on whether the 

15 upscaling was performed on the residual or the low resolution reconstructed block. 

There is more than one possible method to use for intra prediction in the full 
resolution decoder, when applied to H.264/MPEG AVC. Intra prediction could be 
applied at the low resolution, using the same prediction pixels as in the H.264/MPEG 
AVC spec. Alternatively, the method described in U.S. Provisional Patent 

20 Application No. 60/551 ,417 (Docket Number PU040073), filed on March 9, 2004, 
could be used, in which the intra prediction is applied at the full resolution, and a 
larger number of pixels at the full resolution are used in the prediction. 

In an alternative embodiment, the full resolution decoder may decide 
performing motion compensation for a macroblock using the same resolution and 

25 method as for the base layer decoding (i.e. using 16x16 macroblocks), which is then 
upsampled to full resolution. Upsampling could be performed using a bilinear or 
longer tap filter. A full resolution error signal could also be added, if available. The 
decision could be made through additional signaling at the macroblock level (i.e. with 
the presence of a RRU macroblock mode, and a low resolution macroblock mode, 

30 apart from SKIP mode). This process may be desirable for certain cases where, due 
to high motion and texture detail, upsampling the residual would lead to the 
generation of undesirable high frequencies and artifacts. Nevertheless, this would 
also require that the full resolution decoder is able to store, or on the fly generate, 
the low resolution references. The longer tap filter could also incur further 
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complexity, which is although partly compensated from the fact that motion 
compensation is performed for a smaller macroblock. A second, simpler alternative 
solution to the same problem however is to perform motion compensation at full 
resolution, entropy decode and inverse quantize and inverse transform the base 
layer residual but not add it to the motion compensated signal, prior to finally adding 
the full resolution error. This method requires decoding of the base layer residual in 
order to update the entropy context model for the decoding of the remaining 
residuals. This later solution could replace the low resolution macroblock mode, or 
co-exist as an additional mode for the encoding of the full resolution residual. 
The above description and figures assume two layers of scalability, however, this 
concept can be extended to an arbitrary number of layers. 

Advantages and Advancements 

1 . Video decoder in which decoded motion vectors are reduced in accuracy 
while not reducing the resolution of the reference picture prediction. 

2. Scalable video decoder where a base layer prediction residual is upsampled, 
added to the full resolution motion compensated prediction, and added to a 
decoded full resolution enhancement layer error signal. 

3. Decoder of item 2 in which enhancement layer error signal is intra coded. 

4. Application of a deblocking filter to the decoder of item 2, following the 
addition of the error signal. 

5. Deblocking filter of item 4 that also considers the enhancement layer mode 
signals. 

6. Scalable video encoder which creates a base layer and an enhancement 
layer, with the enhancement layer created only for intra-coded slices in the 
base layer. 

7. Scalable video encoder which creates a base layer and an enhancement 
layer, the base layer containing a coded low resolution prediction residual and 
full resolution motion vectors, and the enhancement layer containing a coded 
error signal formed by subtracting from the input picture the result of the 
decoder in item 2. 

8. Scalable video decoder containing a full resolution reference picture store and 
full resolution motion compensation block, a lower resolution inverse quantizer 
and inverse transform, and an upsampler. 
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9. Scalable video decoder which processes both a base layer bitstream and an 
enhancement layer bitstream but includes only full resolution reference picture 
stores and not low resolution reference picture stores. 

10. Decoder of item 2 where optionally, i.e. based on a enhancement layer signal, 
5 the base layer prediction residual is not added to the full resolution motion 

compensated prediction. This is then added to the decoded full resolution 
enhancement layer error signal. 

1 1. Decoder of item 2 where optionally, i.e. again based on a enhancement layer 
signal, the base layer prediction residual is added to the low resolution motion 

10 compensated prediction. This is then upsampled and added to the decoded 

full resolution enhancement layer error signal. 

12. Decoder of item 2 where enhancement layer error signal contains skipped 
residual macroblocks, and macroblock modes signaling how prediction signal 
will be generated based on items 3, 1 0, and 11. 
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Figure 1. Typical Spatial Scalability System 




High Res Input Video 
Sequence 



Downsampler 



I 



Low Res 
Normal 
Encoder 




Scalable High 
Res Encoder 



Upsampler 



T 



Low Res 
Normal 
Decoder 



High Res Scalable 
Bitstream 



Low Res Bitstream 



Figure 2. Spatial Scalable Encoder 
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Figure 3. Spatial Scalable Decoder 
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Figure 4. Standard Video Encoder 
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Figure 5. Standard Video Decoder 
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Figure 6. RRU Video Encoder 
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Figure 7. RRU Video Decoder 
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Figure 8. Complexity Scalability Broadcast System 
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Figure 9. Low Res Complexity Scalable Decoder 
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Figure 1 0. High Res Complexity Scalable Decoder 
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Figure 11. Complexity Scalable Video Encoder 



